Skip to content

Conversation

ldorau
Copy link
Contributor

@ldorau ldorau commented Sep 12, 2025

urDeviceRetain(MDevice) should not be called
in the constructor of device_impl in case of subdevices,
because their reference counter does not drop to zero
and subdevices are not destroyed nor freed now,
what causes memory leaks.

It fixes URT-961.

@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from b07c1d4 to 51a4b37 Compare September 12, 2025 13:43
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 51a4b37 to ab4b9b9 Compare September 12, 2025 13:48
@ldorau ldorau changed the title [UR] Do not call urDeviceRetain() in constructor of device_impl [UR] Do not call urDeviceRetain() in case of subdevices Sep 12, 2025
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from ab4b9b9 to 7c033d0 Compare September 15, 2025 10:05
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 7c033d0 to 054acf1 Compare September 15, 2025 10:15
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 054acf1 to 7eb40c4 Compare September 15, 2025 15:27
@ldorau ldorau changed the title [UR] Do not call urDeviceRetain() in case of subdevices [SYCL] Do not call urDeviceRetain() in case of subdevices Sep 15, 2025
@ldorau ldorau changed the title [SYCL] Do not call urDeviceRetain() in case of subdevices [SYCL] Do not call urDeviceRetain() in case of subdevices Sep 15, 2025
@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 7eb40c4 to 1e745dc Compare September 15, 2025 15:33
@ldorau ldorau marked this pull request as ready for review September 15, 2025 15:34
@ldorau ldorau requested a review from a team as a code owner September 15, 2025 15:34
@ldorau ldorau requested a review from againull September 15, 2025 15:34
@ldorau
Copy link
Contributor Author

ldorau commented Sep 15, 2025

@pbalcer please review

@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 1e745dc to 1ab563c Compare September 15, 2025 16:21
: get_info_impl<UR_DEVICE_INFO_PARENT_DEVICE>()),
// TODO catch an exception and put it to list of asynchronous exceptions:
MCache{*this} {
if (IsSubDevice) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return seems abrupt. It also loses your intention. Why not just say if( !IsSubDevice){ urDeviceRetain } then it's clear WHY we are tracking isSubDevice .

Otherwise, someone in the future could easily refactor this code and the retain could end up higher in the function and the issue reintroduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cperkinsintel
Copy link
Contributor

@ldorau - please do not "force push". That messes up the continuity of the reviews on github. If you are having to "force push" it probably means you are doing something wrong on your end.

@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from 1ab563c to 1344429 Compare September 15, 2025 16:44
@ldorau
Copy link
Contributor Author

ldorau commented Sep 15, 2025

@ldorau - please do not "force push". That messes up the continuity of the reviews on github. If you are having to "force push" it probably means you are doing something wrong on your end.

OK, so what should I do? Add more commits titled:

  • Fixes after review no. 1
  • Fixes after review no. 2
  • Fixes after review no. 3
    ...

?
Will they be squashed before merging?

urDeviceRetain(MDevice) should not be called
in the constructor of device_impl in case of subdevices,
because their reference counter does not drop to zero
and subdevices are not destroyed nor freed now,
what causes memory leaks.

It fixes URT-961.

Signed-off-by: Lukasz Dorau <[email protected]>
}

device_impl &
platform_impl::getOrMakeSubDeviceImpl(ur_device_handle_t UrSubDevice) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please implement a helper function that taken an IsSubDevice bool to eliminate duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ldorau ldorau force-pushed the Do_not_call_DeviceRetain_in_constructor_of_device_impl branch from b6f2cce to 119f6a2 Compare September 16, 2025 09:12
@ldorau ldorau requested a review from pbalcer September 16, 2025 09:13
@ldorau
Copy link
Contributor Author

ldorau commented Sep 16, 2025

@ldorau - please do not "force push". That messes up the continuity of the reviews on github. If you are having to "force push" it probably means you are doing something wrong on your end.

@cperkinsintel OK, a new commit added with fixes

Copy link
Contributor

@intel/llvm-gatekeepers please consider merging

Comment on lines +33 to +37
if (!IsSubDevice) {
// Interoperability Constructor already calls DeviceRetain in
// urDeviceCreateWithNativeHandle.
getAdapter().call<UrApiKind::urDeviceRetain>(MDevice);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destructor device_impl::~device_impl calls urDeviceRelease, which is meant to balance the urDeviceRetain in the constructor. However, after this change, urDeviceRetain is no longer called for subdevices in the constructor, while urDeviceRelease is still invoked for them in the destructor. This imbalance is somewhat concerning. It seems that the actual root cause of the memory leak may lie elsewhere. Could you please clarify what you believe the underlying issue is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that when a device is created RefCount is equal 1 without any call to retain(), but it is destroyed when RefCount is equal 0, so there should be one more call to release() than a number of calls to retain() ... or the implementation of class RefCount is wrong ?

class RefCount {
public:
  RefCount(uint32_t count = 1) : Count(count) {}
  ...
  uint32_t retain() { return ++Count; }
  bool release() { return --Count == 0; }
  void reset(uint32_t value = 1) { Count = value; }

private:
  std::atomic_uint32_t Count;
};

...

ur_result_t urDeviceRelease(ur_device_handle_t Device) {
  // Root devices are destroyed during the piTearDown process.
  if (Device->isSubDevice()) {
    if (Device->RefCount.release()) {
      delete Device;
    }
  }

  return UR_RESULT_SUCCESS;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@againull againull Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldorau @pbalcer In such case, why we differentiate between root device and sub-device in this case? can we just remove retain from the constructor completely? (only sub-devices being ref counted seems like implementation detail of the adapter)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(only sub-devices being ref counted seems like implementation detail of the adapter)"

That's the reason. I pointed it out because it's possible that the existing sequence of creates/retains/releases was always broken, and we never noticed because the leak happens in the rare case of subdevices. So in a way, I'm agreeing with your earlier comment.

@ldorau ldorau requested a review from againull September 17, 2025 09:45
Copy link
Contributor

@intel/llvm-gatekeepers please consider merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants